# GNN-assisted Back-side Clock Routing Methodology for Advance Technologies

Nesara Eranna Bethur<sup>1</sup>, Pruek Vanna-Iampikul<sup>1</sup>, Odysseas Zografos<sup>2</sup>, Lingjun Zhu<sup>1</sup> Guiliano Sisto<sup>2</sup>, Dragomir Milojevic<sup>2</sup>, Alberto García-Ortiz<sup>3</sup>, Geert Hellings<sup>2</sup> Julien Ryckaert<sup>2</sup>, Francky Catthoor<sup>2</sup>, and Sung Kyu Lim<sup>1</sup> <sup>1</sup>Georgia Institute of Technology, Atlanta, USA; <sup>2</sup>IMEC, Leuven, Belgium; <sup>3</sup>Universität Bremen, Bremen, Germany; Email: nbethur3@gatech.edu

## **ABSTRACT**

The back-side metal layers exhibit lower parasitics compared to the front-side layers in advanced technologies, making them suitable for clock-net distribution. In this study, we explore the advantages of using back-side metal layers for clock routing, which is shared with a power delivery network. Our Graph Neural Network (GNN) based framework, effectively distributes the clock-tree between the front and back sides. We address the back-side clock nets' creation by incorporating back-side buffers. Our results demonstrate better clock and full-chip metrics represented by an increase of up to 13% in the effective frequency with equivalent power consumption, using 3 nm technology.

#### **ACM Reference Format:**

Nesara Eranna Bethur¹, Pruek Vanna-Iampikul¹, Odysseas Zografos², Lingjun Zhu¹, Guiliano Sisto², Dragomir Milojevic², Alberto García-Ortiz³, Geert Hellings², Julien Ryckaert², Francky Catthoor², and Sung Kyu Lim¹. 2024. GNN-assisted Back-side Clock Routing Methodology for Advance Technologies. In 61st ACM/IEEE Design Automation Conference (DAC '24), June 23–27, 2024, San Francisco, CA, USA., 6 pages. https://doi.org/10.1145/3649329. 3657333

## 1 INTRODUCTION

The metal stack for the VLSI interconnect generally consists of Mx, My, and Mz metal groups, as shown in Figure 1. As the logic size shrinks, the pitches of the Mx layer in the metal stack are reduced, resulting in increased resistance. Given this scenario, the EDA tools need to maximize signal routing in the My/Mz layers of the metal stack. The situation worsens because the My/Mz layers are shared between the signal, clock, and power delivery network. Therefore, from a high-performance perspective, an ideal solution would be to have a contention-free space for signal routing.

Another pressing problem in advanced nodes is the effect of signal integrity (SI). The wire pitches are reduced to support logic scaling, leading to increased coupling capacitance between the wires. Solutions such as shielding and spacing have been proposed to improve SI with the trade-off of increasing routing congestion. A better utilization of metal resources would be available if we could reduce layer sharing between clock and signal routing.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

DAC '24, June 23–27, 2024, San Francisco, CA, USA © 2024 Copyright held by the owner/author(s). ACM ISBN 979-8-4007-0601-1/24/06.

https://doi.org/10.1145/3649329.3657333





Figure 1: The conventional front-side vs. new back-side clock delivery (cross-section view) using nano-TSV technology [1].

Back-side Power Delivery Network (BS-PDN) [2, 3] is a power delivery scheme where the distribution happens through back-side metal layers. In advanced technologies, nano-scale Through-Silicon-Vias (nTSV) are used to support BS-PDN. This methodology isolates signal and power delivery, thereby having less congested, voltage drop resilient, and high-performance designs.

The back-side metals not only facilitate power distribution but also can offer a platform for other global signal delivery [4], including the clock, due to their low parasitic delays. This strategic relocation of the clock to the back-side can alleviate signal congestion on the front-side, resulting in enhanced performance<sup>1</sup>. For these reasons, the clock delivery is the second step towards valorizing the back-side, after power delivery. So taking inspiration from the Back-side Power Delivery Network (BS-PDN) [1], where the power delivery is performed through the metal layers present on the back-side of the substrate, we propose a novel methodology for intelligently partitioning the clock onto the front-side and the back-side metal layers while optimizing the clock and full-chip metrics. We call this Back-side-CDN (BS-CDN) and the outline of our contributions is described in the following sections:

<sup>&</sup>lt;sup>1</sup>A significant side-effect of BSPDN is that power signals are no longer readily available on the front-side to be used for shielding. Therefore, signal integrity is one of the main reasons to move the clock distribution network on the back-side. Having power and clock signals in the same low-noise plane should improve signal integrity metrics compared to implementations featuring only BSPDN. To the best of the authors knowledge no work has been published assessing the signal integrity properties of back-side metals, and this assessment is beyond the scope of this work.

| Table 1: Technology specifications used in this work are based |
|----------------------------------------------------------------|
| on [6]. The nano-TSV assumption is based on [1].               |

| Metal details   | Group       | Layer range | Width     | R   |
|-----------------|-------------|-------------|-----------|-----|
|                 |             |             | (μm)      | (Ω) |
|                 | Mz          | M6 - M7     | 0.024     | 44  |
| Front-side      | My          | M4 - M5     | 0.018     | 101 |
|                 | Mx          | M1 - M3     | 0.012     | 347 |
| Back-side       | Mb          | MB1 – MB2   | 0.061     | 34  |
| Via details     | Group       | Diameter    | Pitch     | R   |
|                 |             | $(\mu m)$   | $(\mu m)$ | (Ω) |
| clock-TSV       | M1-to-MB1   | 0.09        | 0.18      | 10  |
| power-TSV       | MBPR-to-MB1 | 0.02        | 0.12      | 5   |
| back-side metal | MB1-to-MB2  | 0.040       | 0.08      | 4   |

- To the best of our knowledge, this is the first study that conducts detailed and accurate design and Performance-Power-Area (PPA) analysis for back-side clock delivery and compares it with its front-side counterparts.
- We propose a novel and systematic methodology that makes the best use of metal layers on both sides to optimize clock and full-chip metrics. We also develop clever ways to integrate our approach into commercial flows to ensure that all designs are of commercial sign-off quality.
- We present an unsupervised GNN learning solution to classify the endpoints (EPs) that need to be serviced by the backside clock nets to achieve better performance and reduce the nTSV count. We then introduce a clock-tree partitioning algorithm that moves selected clock nets into the back-side based on GNN-assisted endpoint clustering.

## 2 BACK-SIDE CLOCK BACKGROUND

#### 2.1 Back-side Processing

The main process steps (at wafer-level) involved in fabricating back-side metals are: (i) wafer-to-wafer bonding of target wafer, after front-side processing is done, to a carrier wafer; (ii) flipping the bonded wafer to expose the back-side of the target wafer; (iii) back-side wafer grinding and thinning to sub-micron thickness [5]. State-of-art back-side processing is shown to achieve the fabrication of wide and thick metal layers that exhibit very low resistivity values [2]<sup>2</sup>. These metal layers present an evident opportunity for migrating the PDN of a design to the back-side. Furthermore, the usage of these layers (assuming modest scaling of metal layer pitches) can be extended to distribute the clock networks of an SoC.

# 2.2 Technology Specifications

An in-house 3 nm PDK was developed, incorporating buried power rails (BPRs) and back-side metal layers for PDN and clock routing, as outlined in [6] and Table 1. The nTSV (power-TSV) connects MB1 to MBPR for power distribution, while a distinct nTSV (clock-TSV) links MB1 to M1 for clock routing. The two nTSVs have different dimensions and resistances.



Figure 2: Back-side buffer schematics and layout.

Table 2: Back-side buffer comparison with the front-side counterpart (front). Delay and power are in ps and pW.

|                 | X1    |        |       | X6    |        |       |
|-----------------|-------|--------|-------|-------|--------|-------|
|                 | front | BS-OUT | BS-IN | front | BS-OUT | BS-IN |
| Delay - Rise    | 4.227 | 7.796  | 4.226 | 3.200 | 3.950  | 3.188 |
| Delay - Fall    | 4.450 | 7.610  | 4.450 | 3.650 | 4.501  | 3.635 |
| Trans - Rise    | 3.580 | 11.110 | 3.580 | 1.549 | 2.828  | 1.546 |
| Trans - Fall    | 2.900 | 8.630  | 2.900 | 1.513 | 2.809  | 1.511 |
| Int. Pwr - Rise | 0.026 | 0.270  | 0.026 | 0.151 | 0.404  | 0.151 |
| Int. Pwr - Fall | 0.086 | 0.088  | 0.086 | 0.489 | 0.502  | 0.490 |
| Input cap (pF)  | 0.240 | 0.240  | 1.239 | 1.393 | 1.393  | 2.393 |

To perform the clock routing on the back-side metal layers, we create a new set of buffer cells with back-side contact at MB1 by integrating the nTSV inside the buffer cells. The back-side buffers contain a back-side out buffer (BS-OUT) and a back-side in buffer (BS-IN). The direction of the back-side buffer indicates the nTSV insertion. BS-OUT has the nTSV at the output pins, which is used to propagate the signal to the back-side layers, while the BS-IN has the nTSV at the input pins. We characterized the cells using the Non-Linear Delay Library (NLDM). A summary of the key results is shown in Table 2. For the BS\_IN buffer, the main effect is an increase in the input cap, while the output delay is not affected. For the BS-OUT the input delay is unaffected while the output delay increases slightly. One can also observe that with higher drive strengths like X6, the additional delay added by the nTSV within these back-side buffers decreases, in fact even better than the normal buffer in some cases. Since higher drive strength buffers are used in general for clock trees, this helps our cause. Note that currently, we do not optimize the clock buffers; this additional optimization would increase even further the advantages of our approach.

#### 2.3 Back-side Power Distribution

The utilization of back-side layers extends beyond clock distribution to include power delivery. The back-side power delivery network (PDN) is first implemented with reasonable IR-drop, while allocating spare routing tracks for clock delivery. The buried power rails (BPR) are used for the standard cell's power and ground pins. Power and ground (PG) stripes are established in back-side layers (MB2 - MB1). After forming PG stripes, core rails are developed, with

<sup>&</sup>lt;sup>2</sup>In the referenced paper, Cu single damascene layer (same as the ones used in older CMOS nodes - e.g. 65nm) is used for back-side interconnections with a thickness of >100nm, which would result in very low resistivity lines.



Figure 3: backside power delivery on a processor benchmark.

power nTSVs facilitating connections between the back-side layers and MBPR.

We choose the width and pitch of the back side stripe for a reasonable IR-drop (< 10% of VDD) as shown in Table 4 and allocate the rest of the area to clock routing. This ensures that area resource constraints are met for the IR-drop specification in the BS-CDN design. Figure 3 illustrates the overall PDN structure of the BS-CDN design in this paper. Table 4 reports the PDN usage as well as IR-drop data in our designs.

## 3 BACK-SIDE CLOCK ROUTING

#### 3.1 Problem Definition

With advanced node scaling leading to finer lines and spacing, the impact of parasitic interconnect becomes increasingly critical. Utilizing back-side metal layers for sensitive nets, like the clock, is encouraged due to their lower parasitics. Currently, back-side layers are deployed to decouple power delivery from signal routing, addressing routing congestion from power grids. However, EDA tools lack support for cross-layer connections between front and back-side metals. In this work, we summarize the problem definitions as follows:

#### Problem 1: (Back-side clock enablement)

Given the process design kit (PDK) with back-side PDN utilization, our goal is to enable commercial EDA support for the cross-layer connection between front-side and back-side metal layers to utilize the remaining space in the MB for clock routing.

## Problem 2: (GNN-assisted back-side clock routing)

Our observations show that naively servicing all the timing critical endpoints does not produce the best results as seen in Table 4. So given the initial clock tree structure of a design, our goal is to predict the endpoints to efficiently assign the back-side clock routing to enhance the clock quality and final PPA metrics.

## 3.2 Overview of the Approach

The solution we propose for the aforementioned concerns is a methodology that enables the industry standard PnR tool to partition the clock between the back-side and front-side layers with an intent to optimize the clock and full-chip metrics.

Figure 4 outlines our contributions within the commercial physical design flow. Clock nets are categorized into leaf, which directly connect to End Points (EPs), and trunk nets, which encompass all other clock nets. Upon constructing the clock tree, we implement an algorithm for back-side clock routing. This process involves an



Figure 4: Our proposed back-side clock routing methodology flow. We use the back-side for both clock and power delivery.



Figure 5: Our GNN-based back-side clocking enablement.

unsupervised Graph Neural Network (GNN) approach to identify EPs necessitating back-side clock routing. Subsequently, an algorithm determines the nets that connect to these EPs, transitioning them to back-side nets through the integration of back-side buffers. These buffers facilitate cross-layer connections. While prioritizing EPs for back-side clock routing, some EPs might receive back-side nets without direct need, based on our GNN analysis. It is important to note that leaf clock nets are not transitioned to the back-side due to their requirement for a finer pitch, which may not be available in back-side metal layers. Following the allocation of back-side nets, the process continues with the remaining Physical Design (PD) stages for clock and signal routing.

## 3.3 GNN building and training

To capture all the information of a clock tree for the graph learning process, we use a Graph Neural Network (GNN) graph which consists of not only clock sinks (endpoints) but also the interconnection

of the clock sinks via edges that represent the timing paths between them. We call this a GNN sink graph, which is shown in Figure 5.

For GNN graph training, we use Graph SAGE [7] with the following loss function in Equation 3 to identify endpoints that are not only timing critical but are also similar in terms of clock and timing attributes. Let the graph G = (V, E), where V denotes all the endpoints, and E represents all the timing paths between endpoints. We define each node  $V = \{v_1, \dots, v_n\}$  with initial features defined in the section 3.4. Since endpoints in the clock tree are in a tree structure, changing clock routing for one endpoint would affect other endpoints. Therefore, we perform the neighbor encoding of graph G to aggregate the neighbor information for all endpoints before clustering them to assign the back-side clock routing in the next step. Let k denote the level of updated information. The update process at  $k^{th}$  level is shown as follows:

$$h_{N(v)}^{k-1} = reduce\_mean(\{W_k^{agg} f_v^{k-1}, \forall v \in N(v)\}) \tag{1}$$

$$h_v^k = \sigma(W_k^{proj} \cdot concat[f_v^{k-1}, f_{N(v)}^{k-1}]) \tag{2} \label{eq:defhat}$$

where  $\sigma$  is the sigmoid function,  $h_v^k$  denotes the representation vector of node v at level k, N(v) denotes the neighbors sampled at k-hop,  $W_{\mathbf{k}}^{agg}$ , and  $W_{\mathbf{k}}^{proj}$  denote the aggregation and projection matrices respectively, which together form the neural layer at the level. For each iteration, we minimize the loss function to capture the similarity of endpoints. The loss function is defined in Equation 3:

$$\mathcal{L}(h_v) = -\sum_{u \in N(v)} \log(\sigma(h_v^\top h_u)) - \sum_{i=1}^M \mathbb{E}_{n_i \sim Neg(v)} \log(\sigma(-h_v^\top h_{n_i})),$$
(3)

where Neq(v) represents the negative sampled nodes in the perspective node v, and M represents the negative sampling size. Note that the negative sampling is achieved by performing random sampling on the nodes that are beyond the local neighborhood of the target node v, and we aim to minimize the similarity between the negatively sampled nodes and the target node. The expectation function here also approximates to the sample mean due to our sampling size. Our goal here is to minimize the loss function to have EPs with similar objective features described in the next subsection to be clustered together so that they are serviced by back-side clock.

## Feature Selection for the GNN

We pick the features as shown in Table 3. The benefits of the backside clock from a high-level perspective come from 2 things: reduced signal congestion and better clock quality. For this, we select features like timing slack, path delay, and cell/net delay in the path to denote timing criticality; path depth and total fanout features for path structure; and clock features like launch and capture clock latency. With these features, our aim is to differentiate the EPs that require clock latency to fix timing.

# **Endpoint Clustering**

Post-GNN training, node embedding information is acquired, summarizing the data of adjacent endpoints. The goal is to discern endpoint similarities using clustering algorithms, specifically employing weighted K-means to divide endpoints into two categories: front-side routing endpoints (FS EPs) and back-side routing endpoints (BS EPs). BS EPs are defined by their utilization of back-side

Table 3: Features used in our GNN framework.

| Features              | Description                | Objective          |
|-----------------------|----------------------------|--------------------|
| timing slack          | Worst slack through EP     | timing criticality |
| total cell delay      | Worst path cell delay      |                    |
| total net delay       | Worst path net delay       |                    |
| path delay            | Worst path delay           |                    |
| path depth            | Worst path logical depth   | path structure     |
| total fanout          | Worst path Fanout          |                    |
| launch clock latency  | Worst path launch latency  | clock quality      |
| capture clock latency | Worst path capture latency |                    |

Algorithm 1: Back-side net conversion flow

- 1: **EP** :  $\{ep_1, \dots, ep_n\}$ : back-side endpoints
- 2:  $G_{clock} = (V, E)$  : clock tree graph

**Output:** New clock tree  $G_{bs}^* = (V, E)$  that uses back-side nets

1: 
$$c \leftarrow all\_fanin(ep_1, \dots, ep_n) : \{c_1, \dots, c_k\}$$
  
2:  $n \leftarrow get\_nets(c) : \{n_1, \dots, n_t\}$ 

- 3:  $G_{bs}^* \leftarrow G_{clock}$
- 4: **for** i = 1 ... t **do**
- $l \leftarrow get\_cells(n_i) \cap c$
- $d \leftarrow \text{get driver cells from net } n$
- **if** l is buffer **and** d is buffer **then**
- $G^*_{bs} \sim change\_node(V_l, V_{l_{bs}})$ : change l to back-side  $G^*_{bs} \sim change\_node(V_d, V_{d_{bs}})$ : change d to back-side
- 10:
- 11: end for

routing in the clock path's trunk, whereas FS EPs follow the conventional route with all routing over front-side metal layers. Weighted K-means, an advanced form of standard K-means, accounts for the significance of different data points.

In the weighted k-means algorithm, each data point  $x_i$  is given a weight  $w_i$  to indicate its importance. The goal is to reduce the weighted sum of squared distances between points and their cluster centroids, allowing prioritization of more significant data points. The algorithm's objective is succinctly captured as:

$$\min \sum_{i=1}^{n} w_i \cdot |x_i - c_{k_i}|^2 \tag{4}$$

where  $c_{k_i}$  is the cluster centroid for  $x_i$ , and  $w_i$  is  $x_i$ 's weight.

Once the weight-Kmean algorithm stabilizes and the centroids of all k clusters remain unchanged, the clustering outcome is illustrated in Figure 6. Following the clustering of endpoints, the cluster with the greater average of worst negative slacks is chosen for back-side net servicing. Additionally, optimal results are achieved by considering only the top  $\alpha$  (20-40%, varying by benchmark) of endpoints, ranked by worst slacks. This approach enhances performance and notably decreases the number of clock nTSVs, with detailed results on varying top  $\alpha$  values omitted due to space limitations.

## **Back-side Buffer Insertion**

With given endpoints from Section 3.5 and clock tree graph  $G_{clock} =$ (V, E), where V is the all cells in clock delivery networks, and E



Figure 6: tSNE plot of endpoints being separated into frontside and back-side counterparts in ECG benchmark.



Figure 7: normalized dWL/layer between GNN and FS-CDN. This shows that the GNN scheme uses the  $M_y$  (less resistive) layers over  $M_x$  due to the reduced signal congestion.

is the edge in the clock tree, we perform the back tracing for each endpoint to the clock source. The overview of the algorithm is shown in Algorithm 1. We back-trace all the nets from the endpoint to the clock port for all endpoint  $all\_fanin$  and keep a unique cell list as c. Next, we query all the nets connected to cell list c and keep it as n. We iterate for all nets in n by finding the load cell of each net to ensure the correct net changes to the back side.

In addition, we find the correct load cell (l) by doing the intersection  $\bigcap$  of all cells of net n with c. Then, we replace them with back-side buffers by validating if the driver (d) and load (l) of the given net are both buffers. It changes the cell reference to a back-side buffer with the same driving strength. We lock cells and net between the twos to ensure the tool optimizer does not replace or insert any cells. We iterate until we change all nets n. With this approach, we ensure that the proper nets are transitioned to the back-side, enhancing the clock network's design and performance.

## 3.7 Routing Layer Enforcement

After the back-side buffer insertion, the nets between back-side buffer pairs are assigned to the route in the back-side metal layers. To avoid the usage of the back-side layer from signal routing, we remove the buried power rails routing track to separate the back-side metal layers from the front-side metal layers. These back-side nets are constrained not to have cells inserted to maintain the back-side routing. Moreover, we define another constraint to prohibit using back-side buffers in timing optimization to prevent nets from using back-side buffer sizing.



(b) GNN clock layout & clock structure

Figure 8: Clock Tree layouts and structures of JPEG benchmark using front-side vs. GNN-based back-side clock delivery technologies.

## 4 RESULTS & DISCUSSIONS

## 4.1 Experimental Setup

We evaluate our BS-CDN methodology with three benchmarks: ECG, JPEG and the OpenPiton processor[8] in Table 4. We use the 3 nm technology PDK [6] with up to 9 metal layers. For our tool setup, we use the Synopsys design compiler for logic synthesis and ICC2 for place and route. We choose the initial PDN utilization based on an acceptable IR-drop ( $\leq$  10%). Here, the back-side metals are shared for both the power ground (PG) grid and the clock nets. Our design options:

- FS-CDN: commercial 2D design with auto settings.
- Baseline: Here, all the EPs with a negative slack threshold get serviced by the back-side clock.
- GNN: Here, only the EPs from the GNN algorithm are serviced by the backside clock. Here, we not only see the timing criticality but also the similarity between the EPs.

# 4.2 Full-chip Metrics Analysis

Our GNN model was assessed against traditional PPA metrics — wirelength, power, and timing—ensuring the IR-drop stayed within 10% of VDD, as shown in Table 4. The GNN achieved up to a 13% increase in effective frequency, improved WNS and TNS, with a slight rise in power usage. Cell counts and wirelength were similar across designs, but the GNN model used more  $M_y$  layers, enhancing timing by easing congestion, depicted in Figure 7. While GNN's total power was marginally higher due to additional cells for timing corrections, it significantly reduced the nTSV count, optimizing for the top  $\alpha$  EPs after GNN clustering.

## 4.3 Clock Metrics Analysis

Table 5 presents a summary of the clock metrics analysis of the ECG benchmark. The GNN design achieves the best WNS, with clock wirelength being similar. Figure 8 compares FS-CDN and GNN clock layouts and structures of JPEG benchmark. GNN's higher clock buffer count, contributing to slightly increased clock power, is attributed to beneficial skew utilization. GNN's back-side clock routing efficiently selects endpoints, reducing clock nTSVs compared to Baseline, where more endpoints use the back-side. However, more

Table 4: Full-chip metrics comparison between FS-CDN (front) and back-side counterparts for all design benchmarks.

|                         | Pure-Logic Design |          |                  |        | Processor Design  |                    |        |          |              |
|-------------------------|-------------------|----------|------------------|--------|-------------------|--------------------|--------|----------|--------------|
|                         | ECG (2GHz)        |          | JPEG (4GHz)      |        |                   | OpenPiton (1.1GHz) |        |          |              |
|                         | FS-CDN            | Baseline | GNN (Δ%)         | FS-CDN | Baseline          | GNN (Δ%)           | FS-CDN | Baseline | GNN (Δ%)     |
| Eff. Freq (GHz)         | 1.51              | 1.51     | 1.53 (+13%)      | 3.35   | 3.45              | 3.64 (+8.6%)       | 1.07   | 1.08     | 1.1 (+2.8%)  |
| # metals (back+front)   | 0+8               | 2+8      | 2+8              | 0+9    | 2+9               | 2+9                | 0+9    | 2+9      | 2+9          |
| # Cell                  | 125.1K            | 123.4K   | 124K             | 368K   | 368K              | 364K               | 332K   | 323K     | 354K         |
| # Clock nTSV            | -                 | 1343     | 208 (-84.5%)     | -      | 452               | 150 (-66.8%)       | -      | 2000     | 78 (-96.1%)  |
| Wirelength (m)          | 0.24              | 0.23     | 0.237 (-1.2%)    | 0.74   | 0.73              | 0.72 (+2.7%)       | 1.26   | 1.26     | 1.27 (-0.7%) |
| Total power (mW)        | 457               | 459      | 464 (-1.5%)      | 592    | 600               | 593 (-0.1%)        | 873    | 861      | 887 (-1.6%)  |
| Worst neg slack (ps)    | -162              | -163     | -154 (-4.9%)     | -49    | -40               | -25 (-48.9%)       | -24    | -16      | -2 (-88%)    |
| Total neg slack (ns)    | -1248             | -1115    | -1187 (-4.8%)    | -380   | -428              | -351 (-7.6%)       | -90    | -28      | -8.5 (-71%)  |
| MB1 - width/pitch/util. | 0.17μm/6.0μm/18%  |          | 0.17μm/5.0μm/18% |        | 0.17μm/5.0μm/34%  |                    |        |          |              |
| MB2 - width/pitch/util. | 1.5μm/8.0μm/3%    |          | 1.5μm/10.0μm/2%  |        | 1.5μm/10.0μm/4.2% |                    |        |          |              |
| IR-drop (% Vdd)         | 8.04%             | 8.1%     | 8.28%            | 4.24%  | 4.36%             | 3.6%               | 0.9%   | 0.8%     | 1%           |

Table 5: Clock metrics comparison between FS-CDN and backside Clock counterparts in ECG benchmark.

|                         | Pure-Logic Design        |       |               |  |  |
|-------------------------|--------------------------|-------|---------------|--|--|
|                         | ECG (2GHz)               |       |               |  |  |
|                         | FS-CDN Baseline GNN (Δ%) |       |               |  |  |
| Clock WL (mm)           | 10.1                     | 10.8  | 10.6 (+4.9%)  |  |  |
| # Clock buffers         | 2357                     | 2182  | 2465 (+4.5%)  |  |  |
| Clock buffer area (um²) | 96.4                     | 100.1 | 105 (+8.9%)   |  |  |
| Clock Pwr (mW)          | 26.5                     | 26.8  | 27.2 (+2.6%)  |  |  |
| Clock Skew (ps)         | 249.4                    | 229.4 | 228.5 (-8.3%) |  |  |
| Clock Latency (ps)      | 279.1                    | 264.6 | 268.6 (-3.7%) |  |  |
| Worst neg slack (ps)    | -161                     | -163  | -154 (-4.9%)  |  |  |

Table 6: Critical timing path comparison between FS-CDN (front) & GNN-based BS-CDN (GNN).

| Openpiton (1.1GHz)       | FS-CDN | GNN (Δ%)      |
|--------------------------|--------|---------------|
| Timing Slack (ps)        | -28    | -2 (-92.8%)   |
| Cell delay (ps)          | 626    | 666 (+6.39%)  |
| Net delay (ps)           | 261    | 207 (-20.69%) |
| Launch clock delay (ps)  | 477    | 493 (+3.35%)  |
| Capture clock delay (ps) | 418    | 435 (+4.07%)  |



Figure 9: FS-CDN vs GNN critical path layouts.

back-side clock nets don't always yield better results, highlighting GNN's effectiveness in endpoint selection for back-side routing.

## 4.4 Timing Critical Path Analysis

We observe the critical paths in the full-chip designs implemented with the FS-CDN and GNN approaches in Table 6 and Figure 9.

The improvement in the critical performance of the GNN path over the FS-CDN path is attributed to a 20% lower wire delay from the higher metal layer utilization due to decreased signal congestion.

## 5 CONCLUSION

We propose the BS-CDN methodology, an approach that enhances performance in advanced technologies by routing a part of the clock tree through the back-side BEOL, thus optimizing signal routing space. Servicing more endpoints with the back-side clock naively does not yield optimal results. So, we adopt a GNN-based methodology that learns not only the timing criticality but also the endpoint similarity for providing the back-side clock. We show that an unsupervised GNN method can classify the trunk nets to be routed on the back-side boosting the advantages of a naive back-side clock routing distribution. Our methodology with reasonable IR-drop numbers improves clock and full-chip metrics, from better skew utilization and decreased signal congestion.

#### **6 ACKNOWLEDGEMENT**

This work was supported by the Semiconductor Research Corporation under the JUMP 2.0 Center Program (CHIMES 3136.002) and Samsung Advanced Institute of Technology (SAIT) under the AI for Semiconductors Program.

#### REFERENCES

- A. Veloso et al. Insights into Scaled Logic Devices Connected from Both Wafer Sides. In 2022 IEDM, pages 23.3.1–23.3.4, 2022.
- [2] A. Jourdain et al. Buried Power Rails and Nano-Scale TSV: Technology Boosters for Backside Power Delivery Network and 3D Heterogeneous Integration. In 2022 IEEE 72nd ECTC.
- [3] M. Shamanna et al. E-Core Implementation in Intel 4 with PowerVia (Backside Power) Technology. In 2023 VLSI Technology and Circuits, pages 1–2, 2023.
- [4] R Chen et al. Opportunities of Chip Power Integrity and Performance Improvement through Wafer Backside (BS) Connection: Invited Paper. In Proceedings of the 24th ACM/IEEE Workshop on System Level Interconnect Pathfinding, 2023.
- [5] A Jourdain et al. Extreme wafer thinning and nano-tsv processing for 3d heterogeneous integration. In 2020 ECTC, pages 42–48, 2020.
- [6] SM Shaji et al. A comparative study on front-side, buried and back-side power rail topologies in 3nm technology node. In ISLPED 2023.
- [7] W Hamilton et al. Inductive representation learning on large graphs. Advances in neural information processing systems, 30, 2017.
- [8] J Balkind et al. OpenPiton: An Open Source Manycore Research Framework. In Proceedings of the 21st ASPLOS 2016.